Biometrika (2009), xx, x, pp. 1-9 
© 2007 Biometrika Trust 
Printed in Great Britain 



Adaptive approximate Bayesian computation 

By Mark A. Beaumont 

School of Biological Sciences, University of Reading, PO Box 68, Whiteknights, Reading RG6 

6BX, U.K. 
m.a.beaumont@reading.ac.uk 

Jean-Marie Cornuet 
Department of Epidemiology and Public Health, Imperial College, London SW7 2AZ, U.K. 

jmcomuet@ensam.inra.fr 

Jean-Michel Marin 

Institut de Mathematiques et Modelisation de Montpellier, Universite Montpellier 2, Case 
Courrier 51, 34095 Montpellier cedex 5, France 
Jean-Michel.Marin@univ-montp2.fr 

Christian P. Robert 

Centre de Recherche en Mathematiques de la Decision, Universite Paris Dauphine, 75775 

Paris cedex 16, France 
xian@ceremade.dauphine.fr 

Summary 

Sequential techniques can enhance the efficiency of the approximate Bayesian computation algorithm, 
as in Sisson et al.'s (2007) partial rejection control version. While this method is based upon the theoret- 



ical works of |Del Moral et aL ( 2006| l, the application to approximate Bayesian computation results in a 



bias in the approximation to the posterior An alternative version based on genuine importance sampling 



arguments bypasses this difficulty, in connection with the population Monte Carlo method of Cappe et al. 
( |2004) l, and it includes an automatic scaling of the forward kernel. When applied to a population genetics 
example, it compares favourably with two other versions of the approximate algorithm. 

Some key words: Markov chain Monte Carlo; partial rejection control; importance sampling; sequential Monte Carlo. 



1. Introduction 



When the likelihood function is not available in a closed form, as in population genetics, Pritchard et al. 
([1999 ) introduced approximate Bayesian computational methods as a rejection technique bypassing the 
computation of the likelihood function via a simulation from the corresponding distribution. Namely, if 
we observe y ^ f{y \ 6) and if 7r(6') is the prior distribution on the parameter 0, then the original approx- 
imate Bayesian computation algorithm jointly simulates 6' ^ tt{9) and x ^ f{x \ 9'), and accepts the 
simulated 6' if, and only if, the auxiliary variable x is equal to the observed value, x — y. This algorithm 
is exact in that the accepted 9"s are distributed from the posterior In the more standard occurrence when 
2/ is a continuous random variable, the approximate Bayesian computation algorithm relies on an approx- 
imation, where the equality x = y is replaced with a tolerance condition, g{x, y) < e, g being a measure 
of discrepancy, for instance a distance between summary statistics, and e > being the tolerance bound. 
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The output is then distributed from the distribution with density proportional to tt{9) pTg{g{x, y) < e}, 
where pr^ represents the probability distribution of x indexed by a value of 6. This density is denoted 
by tt{9 I g{x, y) < e}, where the conditioning corresponds to the marginal distribution of g{x, y) given 
y. Improvements to this general scheme have currently been achieved either by modifying the proposal 



distribution of the parameter 6 to increase the density of x's within a neighbourhood of y (Marjoram et al 



2003 Bortot et al. 2007| l or by viewing the problem as a conditional density estimation and developing 
techniques to allow for larger e ( [Beaumont et al.| [ 2002| ). 

Marjoram et al. ( 2003| l defined a Markov chain Monte Carlo version of the approximate Bayesian 



computation algorithm that enjoys the same validity as the original algorithm, namely that, if a Markov 
chain (S*^*') is created via the transition that sets equal to a new value e' - K{e' \ 6'(*)) only if 

x^ f lx\ 0') issuchthata; = ?/andifM~Z^(0,l) < Tr{d')K{9^*'> \ 0') / TT{e'^*^)K{e' \ 6'(*)), then its sta- 
tionary distribution is the posterior distribution 7r(0 | y). Since, in most settings, the distribution of y is 
absolutely continuous, the above constraint x — y\s replaced with the approximation q{x, y) < e. 



10, 10) and X I 61 ~ 
is the 



Example 1. For the toy model studied in Sisson et al. (2007i, where 6 ^ U{ 
• 5Af{6, 1) + • 5^/(9, 1/100), the posterior distribution associated with the observation y 
normal mixture 

6» I J/ = - • 57V(0, 1) + • 57V(0, 1/100) 

restricted to the set [—10,10]. As in regular Markov chain Monte Carlo settings, the performance of 
Marjoram et al.'s (2003) algorithm depends on the choice of the scale r in the random walk proposal, 
K{9' I 9) = t^^(p{t^^{9 — 9')}, where (p is most often a standardised normal or a f density. However, 



even when t = • 15 as in [Sisson et al. ( 2007| l, the Markov chain mixes slowly, but still produces an 
acceptable fit over T = 10^ iterations. Furthermore, the true target is available here as 



tt{9\ I X |< e) cx $(e- 



$(-£ -9) + ${10(e - 61)} - ${-10(e + 9)} 



and, for the value e = • 025, it is indistinguishable from the exact posterior density tt{9 \ y = 0). We 
can thus clearly separate the issue of poor convergence of the algorithm, depending on t, from the issue 
of approximating the posterior density, depending on e. 



jSisson et al.j ( [2007) 1 modified the approximate Bayesian computation algorithm using partial rejection 
control, introduced in Liu (2001|l. The method is sequential in that simulated populations of TV points 



are generated at each iteration of the algorithm and that those populations are exploited to produce better 



proposals for a given target distribution in later iterations. As demonstrated in Douc et al. (2007 1, the re- 



liance on earlier populations to build proposals preserves convergence properties provided an importance 
sampling perspective is adopted. We also recall that a progressive improvement in the performance of 
proposals is the appeal of using a sequence of samples, rather than a single one. The sequential feature of 
the problem is augmented by the fact that the tolerance e is decreasing with iterations, being chosen either 
from a deterministic scale or from quantiles of earlier iterations as in Beaumont et al. ( 2002) l. 



Sisson et al.'s (2007) algorithm produces samples (6'f , • • • , ^j^'' ) by using, at iteration t — 1, a regular 
approximate Bayesian computation step and, at each iteration t — 2, . . . ,T, Markov transition kernels Kt 



for the generation of the 9'f^'s, namely 9f ^ 



Kt{9 I 9*) , until X - /(x I 9 



(t) 



where 9* is selected at random among the previous 9f 's with probabilities ujf 
probability ujf^ is derived by an importance sampling argument. 



) is such that q{x, y) < e, 
{i= 1,..., TV). The 



ujI cx Tr{9l ')Lt- 



,{9* 1 0f)){7r(r)i^,(0f' I r)}- 



(1) 



where Lt-i is an arbitrary transition kernel. In their examples, [Sisson et al. (20071 use Lt-i{9' \ 9) 
= Kt(9 I 9'), which means that the weights are all equal under a uniform prior. The ratio ([T]l is inspired 
from [Del Moral et"ar ( 2006) who use a sequence of backward kernels Lt_i in a sequential Monte Carlo 



algorithm to achieve unbiasedness, up to the renormalisation effect, in the marginal distribution of the 
current value without computing this intractable marginal. 
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97 In this paper, we show via both theoretical and experimental arguments that the weight ([TJ is biased. 

98 Moreover, we introduce a population Monte Carlo correction to the algorithm. It is based on genuine 

99 importance sampling arguments and we demonstrate its applicability as well as the improvement it brings 

100 compared with the partial rejection control version. We also illustrate its efficiency in a population genetics 

101 example, when compared with two standard alternatives. 
102 

103 

104 2. Bias of the partial rejection control version 

2-1. Distribution of the partial rejection control sample 

jQ-y In order to establish the bias in the weight ([T]i as clearly as possible, we consider the limiting case when 

e = 0. In that case, both Pritchard et al.'s (1999) and Marjoram et al.'s (2003) algorithms are correct sam- 
plers from t:{9 \ y). The corresponding partial rejection control version selects a from the previous 
sample and then generates both 6' ^ Kt{9 \ 6*) and x ^ f{x \ 9') until x — y. To properly evaluate the 
bias associated with one step of Sisson et al.'s (2007) algorithm, we also assume that the previous sample 
is correctly generated from the target, i.e. that 9* ^ 7r{9 \ y). Then, denoting by the selected 9*, the 
joint density of the accepted pair(6l(*~i),6l(*')is proportional to 7r(6'(*-i) | y)Kt{9^*^ \ 6'(*"^))/(y | 9^^''), 
where the normalisation constant only depends on y. Using ([TJ, the weighted distribution of 0*^*-' is such 
that, for an arbitrary integrable function h{9), E{ujth{Q^^^)} is proportional to 



108 
109 
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113 
114 
115 

116 r r . . ^c/5(t)^r. 

117 
118 



121 X I 9^'-^^)f{y I 0(*))d0(*~i)d0(*) 

122 

123 a / \ y) \ [ Lt-i{9^'-^^ \ 9^'^)f{y \ 9^'-^^)de^''^^ \ d^^ 



124 ^-^ 

125 with all proportionality terms being functions of y only. Therefore we can conclude that there is a bias 

126 in the weight ([T]) unless the inner function integrates in ^'^^^^ to the same constant for all values of 

127 0(t). Apart from this special case, which is achievable when Lt-i{9^^~^^ \ 9^^^) = g{9^*~^^) but not in 

128 the random walk type proposal when Lt_i(6'*^*^^) | 9^^^) — tp{9^*^^'> — 0'^*^), ([TJ is incorrect since the 

129 weighted output is not distributed from Tr{9 \ y). 

130 Paradoxically, the weight Q used in this partial rejection control version misses a /(y | f?'*^^)) term 

131 in its denominator, while the method is used when f{y \ 9) is not available. This is exactly the difference 

132 between the weights used inl Sisson et al.|p007 1 and those used in Del Moral et al. (2006 1, namely that, in 



133 the latter paper, the posterior 'k{9'^*~^^ \ y) explicitly appears in the denominator instead of the prior The 

134 accept-reject principle at the core of approximate Bayesian computation allows for the replacement of the 

135 posterior by the prior in the numerator of the ratio (fill, but not in the denominator. 
136 

137 2-2. A mixture illustration 

1 38 

When using the standard version of the algorithm that includes an additional approximation due to the 
tolerance zone y) < e, there is no reason for the bias to vanish, even though experiments show that 
the bias in the weights ([T]i for the tolerance target Tr{9 \ g{x, y) < e} generally decreases as e increases, 
which agrees with the fact that the limiting case is the prior. The following example illustrates this point: 

142 

143 Example 1 . For the mixture target and a normal random walk kernel Kt, the first row of Figure[T]shows 

144 the output of five consecutive iterations of Sisson et al.'s (2007) algorithm, using a decreasing sequence of 
et's, from ei = 2 down to 65 = • 01, and a standard deviation in Kt equal to r = • 15. Clearly, using 
this algorithm leads to a bias in terms of the tolerance target for all values of et, the tails being poorly 
covered. In contrast, using a much larger r = 1/0 • 15 results in a good fit of the tolerance target, as 
shown by Figure 2 in Sisson et al. ( 2007[ l. This fit does not contradict the bias exhibited above since using 
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Fig. 1. Histograms of the weighted samples of size Ad — 
5 X 10'^ produced by five consecutive uses of |T]( on the 
first row and the population Monte Carlo version on the 
second row, for the mixture target, with et = 2, 1.5, 1,0- 
5, ■ 01, using for both rows the scale r = ■ 15. The ex- 
act posterior density is plotted as a dotted curve, while the 
target of the simulation algorithm, n{9 | g{x, y) < e}, is 
represented by full lines. Both densities are identical in the 
last column. 



a large scale in the proposal Kt very closely amounts to using a flat prior distribution. The corresponding 
normal density is then almost constant in the interval [—3, 3] that supports the posterior distribution. For 
T = 1/0 • 15, using the weight Q is thus equivalent to Pritchard et al.'s (1999) original algorithm and 
therefore as inefficient for this target. 



3. Correction via importance sampling and population Monte Carlo 

3 1. Population Monte Carlo 

Since the missing factor in ([T]) is the unknown likelihood f{x \ 6*^*^^^), a first resolution of the problem 
is to resort to an estimation of the likelihood based on earlier samples. But a standard importance sampling 
perspective allows for a more direct approach, in a spirit similar to the population Monte Carlo algorithm 



of Cappe et al. (2004i. 
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193 Since the t-th iteration sample is produced from the proposal distribution 

194 



N 



195 ^.rfl(*))oc^c.f-i)i^,(0« l^f'i)), 



which does not depend on the distribution tt of the o'f ^'''s. As in the original population Monte Carlo 



196 
197 

198 a natural importance weight associated with an accepted simulation of^ is 
199 
200 

201 The unbiasedness of this correction results from the identity 

(t) 

204 J 77(0'-*)) 
205 
206 

method, the fact that Kt may depend on simulations from earlier iterations does not jeopardise the validity 
of the method. In addition, |Douc et aL| ( |2007| l proved that Kt must be modified at each iteration for the 
iterations to bring an asymptotic improvement on the Kullback-Leibler divergence between the proposal 

2j^Q TCt and the target: for instance, if the variance of the random walk does not change from one iteration to 

2 J the next, the approximation of the target by itt does not change either and it is then more efficient to run a 

2j^2 single iteration with twice as many points. Since ttj is an approximation to the distribution of the sample 

2j^2 simulated at iteration t, when marginalised against all previous samples, this version of the approximate 

2j^^ Bayesian computation algorithm can be interpreted as an approximate version of the sequential Monte 

2 Carlo algorithm of Del Moral et al.|p006[ ), when using the optimal backward kernel. 

2 When considering component- wise independent random walk proposals, 

218 

219 for each component 6'[*-' of the parameter vector 0^*^, the asymptotically optimal choice of the scale factor 

220 Tfc is available. Indeed, when using a Kullback-Leibler measure of divergence between the target and the 

221 proposal, 

222 / 

223 E I log 

224 V 
225 

22g where the expectation E is taken under the product distribution {6'^*\ 6'^*"^^) ~ 7r(0(*) | y) x 7r(0(*~^^ | 

22y y) , the minimisation of the KuUback divergence leads to the component-wise maximisation of 

228 E[log V{Tfc"^(6'l,'^ - 6i['~^^)}] under the product distribution 7r(6'(*) | y) x 7r(6l(*-i) | y). The optimal 

229 scale is then equal to i?{(6'^*^ — 6*^* ^■')^}, that is, t| = 2var(0fe | y) , under the posterior distribution. The 

230 implementation of this updating scheme on the scale is straightforward. 

23 1 The algorithmic rendering of the corresponding optimised population Monte Carlo scheme is thus as 

232 follows: 

233 Population Monte Carlo approximate Bayesian computation: 
234 
235 

236 1. At iteration t = 1, 

237 Fori = l,...,iV 

238 Simulate O^^^ ~ tt{9) and x ~ f{x \ 6lf ') until g{x, y) < ei 

239 Seta;f' = l/iV 

240 Take as twice the empirical variance of the d[^^ 's 
2. At iteration 2 <t <T, 

For i = 1, TV, repeat 



I y) m-kMr^\oi'^ - et'^)}f{y I ^(*)) 



Given a decreasing sequence of tolerance thresholds ei > • • • > et. 



Pick 9* from the S^' 's with probabilities o;^* ^' 
generate ] 9* ~ JV{9:,t?) and x f{x \ 9f^) 
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241 until g{x, y) < €t 

242 Set^W oc.(^f))/Ef..-f-V{rr^ (^f -^f-^^} 
Take rt+i as twice the weighted empirical variance of the ^^^^ 's 

The expression of the importance weight uj'f '' involves a sum of N terms and, therefore, the computa- 



tional cost of this step of the population Monte Carlo approximate Bayesian computation algorithm is in 
0{T N^). However, in realistic applications of approximate Bayesian computations, the main cost is as- 
sociated with the previous step, namely the repeated simulations from the sampling density. For instance, 
in the population genetics example, the algorithm spends at least 95% of the computing time in the repeat 
loop and less than 5% of the time on the remaining computations. 

3-2. A mixture illustration 



246 
247 
248 
249 
250 
251 
252 
253 

254 Example 1 . For the mixure target, using the population Monte Carlo version leads to a recovery of the 

255 target, whether using a fixed standard deviation t • 15 as shown on the second row of Figure [T] or 
255 T = 1/0 • 15, or a sequence of adaptive t^'s as in the above algorithm. The graphical difference between 

257 these implementations is genuinely difficult to spot; the estimated variance stabilises very quickly in this 

258 example. 
259 

2(3Q 3-3. A population genetics example 

261 This example considers a simple evolutionary scenario of two populations having diverged from a 

262 common ancestral population. Data consists of the genotypes at five microsatellite loci of 50 diploid 

263 individuals from each of the populations. Loci are assumed to evolve according to the strict stepwise 

264 mutation model: when a mutation occurs, the number of repeats of the mutated gene increases or decreases 

265 by one unit with equal probability. Once diverged, populations do not exchange gene and there is no 

266 migration. The natural parameters of this model are the three effective population sizes, A^i, N2 and Nana, 

267 the time of divergence (tiHy) and the mutation rate {^) assumed here to be common to all loci. While the 

268 analyses will be performed with these natural parameters, only identifiable combinations of those, such as 

269 the three parameters 9i = ANi^, 62 — 4Af2M ™d 9a = ANanclJ-, and the parameter r^ii, = tdivfJ-, will be 

270 considered in the output. 

271 In this experiment, several simulated datasets have been produced using the software developed by 



272 [Cornuet et aLlpOO g), with the following parameter values: A^i = Nanc = 10, 000, N2 = 2, 000, tdiv = 

273 1, 000 and ^ = • 0005, out of which one is presented in this paper, but identical conclusions were drawn 

274 from the other datasets. 

275 The tolerance region {g{x,y) < e} used in the approximate Bayesian computation schemes is based 

276 on twelve summary statistics as in Cornuet et al. (2008), namely mean number of alleles, mean genie 

277 diversity, mean size variance and mean Garza- Williamson M index for each population sample, Fst and 

278 (^m)^ distances between population samples, and mean probability of assignment of each sample to the 

279 other population. The distance g{x, y) is then chosen as the Euclidean distance between the observed 

280 and the simulated summary statistics, normalised by their standard deviation under the predictive model. 

28 1 The derivation of the distance thus requires a preliminary evaluation of those standard deviations for each 

282 summary statistic by simulation. 

283 Each dataset has been submitted to three parallel analyses: a standard approximate Bayesian computa- 

284 tion analysis following Beaumont et al.'(2002'), performed via Cornuet et al.'s (2008) software, a tempered 



285 Markov chain Monte Carlo analysis as described in Bortot et al. ( 2007), and a population Monte Carlo 

286 analysis. In each case, the same prior distributions have been used, consisting in a J7[10^ , 10^] prior on the 

287 three effective sizes, a U[10, 10^] prior on the time of divergence, and a U[10^'^, 10^^] prior on the muta- 

288 tion rate. In addition, since Beaumont et al.'s (2002) algorithm includes a final local regression adjustment, 
this adjustment has been performed on the output of both alternative algorithms. 

In order to operate a fair comparison between the three methods, the computing times are kept identi- 
cal. Thus, reference posterior distributions have been constructed with Beaumont et al.'s (2002) approach, 
using = 3 X 10^ simulated datasets and choosing e as the • 01 quantile of the distances. For Bortot et 
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Fig. 2. Variability of the different alternative schemes eval- 
uated though five replicates of the density estimates of the 
posterior distributions of four identifiable parameters of the 
population genetics model. First line: population Monte 
Carlo version; second line: tempered Markov chain Monte 
Carlo version; third line: Beaumont et al.'s (2002) ver- 
sion of the approximate Bayesian computation algorithm, 
against the reference posterior, obtained by 5 x W sim- 
ulations in Beaumont et al.'s (2002) version (dotted line). 
The vertical lines identify the true values of the parameters. 



al.'s (2007) approach, the tuning parameter driving the acceptance threshold e is drawn from an exponen- 
tial prior with parameter equal to the mean distance of the ten closest datasets (among 5000 generated in a 
pilot simulation), followed by 23, 000 regular iterations with a thinning factor of 23. Each Markov move 
combined two simultaneous updates : an independent draw of e from its exponential prior and a random 
walk based on a log-normal deviate with standard deviation equal to • 3 of one of the other parame- 
ters chosen uniformly at random, the latter involving the computation of a Hastings term. Lastly, in the 
population Monte Carlo version, ei is based on the preliminary simulation as the • 1 quantile and four 
iterations are performed with £2=0- 75ei, £3 = 0- 9e2 and £4 = 0- 9e3, and with truncated normals in 
the random walk. Each iteration is based on samples of size 10"^. All three versions then require an aver- 
age 4 • 5 minutes. Figure |2] summarises the output of the comparison between the three methods over five 
independent runs for the same simulated dataset and it shows that, at least in this example, the posterior 
evaluations based on population Monte Carlo are more stable than Bortot et al.'s (2007) algorithm, while 
comparable with Beaumont et al.'s (2002). 
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337 4. Conclusion 

1.1.0 

While Sisson et al.'s (2007) algorithm induces biased weights, with a visible impact on the quality of 

the approximation, we have shown that the same Markov transition kernels and thus the same computing 

power can be used to produce an unbiased scheme. 

The population Monte Carlo scheme is based on an importance argument that does not require a back- 

ward kernel as in ([T]). We have thus established that the adaptive scheme of Cappe et al.| ( |2008| l is also 

appropriate in this setting, towards a better fit of the proposal kernel Kt to the target Tr{6 \ q{x, y) < e}. 

^^"^ From a practical point of view, the number of iterations T can be controlled via the modifications in the 

^'^^ parameters of Kf, a stopping rule being that the iterations should stop when those parameters have settled, 

^'^^ while the more fundamental issue of selecting a sequence of e^'s towards a proper approximation of the 

^'^'^ true posterior can rely on the stabilisation of the estimators of some quantities of interest associated with 

^'^^ this posterior. 
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